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Abstract 

We present a technique to characterize differentially expressed genes in 
terms of their position in a high-dimensional co-expression network. The set- 
up of Gaussian graphical models is used to construct representations of the 
co-expression network in such a way that redundancy and the propagation of 
spurious information along the network are avoided. The proposed inference 
procedure is based on the minimization of the Bayesian Information Criterion 
(BIC) in the class of decomposable graphical models. This class of models 
can be used to represent complex relationships and has suitable properties that 
allow to make effective inference in problems with high degree of complex- 
ity (e.g. several thousands of genes) and small number of observations (e.g. 
10-100) as typically occurs in high throughput gene expression studies. Tak- 
ing advantage of the internal structure of decomposable graphical models, we 
construct a compact representation of the co-expression network that allows to 
identify the regions with high concentration of differentially expressed genes. 
It is argued that differentially expressed genes located in highly interconnected 
regions of the co-expression network are less informative than differentially 
expressed genes located in less interconnected regions. Based on that idea, 
a measure of uncertainty that resembles the notion of relative entropy is pro- 
posed. Our methods are illustrated with three publically available data sets on 
microarray experiments (the larger involving more than 50,000 genes and 64 
patients) and a short simulation study. 
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1 Introduction 



The advent of modern functional genomics opened the possibility of studying in de- 
tails complex relational patterns between genes and phenotypes. The simultaneous 
access of expression levels of several thousands of genes can be ro utinely determ- 
ined by transcriptomi c , prot eomic and metabolomic technologies (|jourdan et al. . 



2010. lYohannes et al.L |2008|) . Although a large amount of this type of data has 



recently been produced, a significant portion of this inf ormation might be redund- 
ant be cause gene expression levels tend to be correlated dPorog ovtsev and MendesL 



20021) . The present work describes techniques to characterize the structure of inter- 



dependency between the levels of gene expressions and use this characterization to 
prune part of this redundancy from large scale gene-expression data (e.g. microar- 
rays and alike techniques). The general idea is to find a simultaneous representation 
of the values of the expression levels of a large number of genes (co-expression net- 
work) and then take advantage of the internal structure of this network to identify 
groups of genes that are potentially informative. 

The following scenario illustrates our ideas. Suppose that the expression levels 
of several thousands of genes are determined in a number of samples (e.g. a mi- 
croarray experiment). Some of these samples are from diseased and some from 
healthy tissues. Typically, a list of genes with expression levels presenting statist- 
ically significant differences between diseased and healthy tissue is produced. In a 
naive approach, this list is then directly used to search for patterns of associations 
between gene regulation and the ti ssue's disease status. Hovyever, s ince gene ex- 
pression levels are often correlated (IDorogovtsev and Mendesll2002|) . part of those 
associations might be spurious. For example, there might exist a group of genes, 
say A, for which their expression levels are changed in diseased tissues due to the 
direct effect of the disease. Associations between genes in group A and the effect of 
the disease status are genuine. On the other hand, there might exist another group 
of genes, say B, for which their expression levels are associated to the expression 
levels of the genes in group A due to general gene regulatory mechanisms, but for 
which the disease has no direct effect. The expression levels of the genes in group 
B would be associated with the expression levels of genes in group A anyway, in- 
dependently of the disease status of the tissue. In this case, observed associations 
between the expression levels of the genes in group B and the disease status of the 
tissue are spurious. Here the problem is that usually it is not possible to determine if 
a differentially expressed gene belongs to a group of genes like group A or group B 
and therefore it is not possible in general to establish whether a declared association 
is spurious or not. 

The discussion of the scenario above suggests that it is important to take into 
account the global structure of the joint expression levels of the genes in an exper- 
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iment (i.e. the structure of the co-expression network) to identify groups of dif- 
ferentially expressed genes that are alike to present a genuine association with the 
effect of a perturbation (e.g. the disease status of a tissue). Associations between a 
perturbation and the expression levels of genes cannot be disentangled if the genes 
are located in a very interconnected region of the co-expression network. Con- 
sequently, it is not possible to determine whether observed patterns of differential 
expression involving those genes are spurious or not. On the other hand, associ- 
ations involving differentially expressed genes located in less interconnected re- 
gions of the co-expression network are less entangled and have a larger potential 
to represent genuine associations. Another important aspect is that the relative po- 
sition of a gene in the co-expression network might affect the interpretation of the 
results. The differential expression of informative genes located in a central area of 
the network might suggest alterations in basic vital functions. On the other hand, the 
differential expression of genes located in the periphery of the network might indic- 
ate alterations associated to specific aspects related to the perturbation performed. 
This is a consequence of the flux of information in the co-expression network (see 
the discussion). In summary, the two aspects of the interpretation of the differen- 
tial expression of genes discussed above can only be assessed if the position of the 
differentially expressed genes in the network is taken into account and not only the 
fact that a gene is differentially expressed or not. 

The characterization of a large gene co-expression network is not an easy task 
since it requires the simultaneous representation of the expression levels of several 
thousands of genes, possibly forming rather complex patterns of interdependency. 
We propose here to use the set-up of graphical models ( Lauritzen . 1 1 996i. Whittakerl 
[1990). In this approach the network of co-expression is characterized by an undir- 
ected graph where the vertices (i.e. points in a graphical representation) represent 
the genes in study. Two vertices (i.e. two genes) are connected by an edge (i.e. 
a line in a graphical representation) when their expression levels are conditionally 
correlated given the expres s ion le vels of all the othe r gene s in the network (see 

( 2010 ) for similar propo sals 



BoUob as (2000l). lLabouriaul(l2000h and lEdwards et al 



specific to genetics and gene expression and lWhittakeii (1 19901) . iLauritzenl (1 19961) for 



pr 



the general theory of graphical models). The use of conditional correlations is the 
first step to avoid that redundant information is represented in the network. How- 
ever, as it will be clear from the examples presented here, graphical models yield 
representations of the co-expression network that are still too complex for the pur- 
pose of identification of general patterns of differential expression. Therefore we 
propose an algorithm that uses the internal structure of a rich class of graphical mod- 
els (the decomposable models) to obtain a tight representation of the co-expression 
network. This compact representation will yield a better visualization of the dis- 
tribution of the differentially expressed genes along the co-expression network and 
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will allow us to define a measure of informativeness for differentially expressed 
genes. 

The methods informally described above involve a difficult problem of statist- 
ical inference. The type of gene expression data we have in mind typically contains 
few observations (say 10 to 100) and a much larger number of variables (10 to 60 
thousand genes) which renders infeasible the use of standard techniques for stat- 
istical inference for graphical models. We will describe a way to circumvent this 
problem by basing the inference on the minimization of the BIC (Bayesian Inform- 
ation Criterion) and by restricting the class of graphical models to the sub-class of 
decomposable graphical models. There exists a rather efficient algorithm already 
implemented for making statistical inference for hig h dimensiona l decom posable 
graphical model s (the gRapHD package des c ribed in lAbreu et al.l (|2010l) and im- 
plemented in R, Ir Development Core TeamI ( 20101) ). This algorithm can also be 
applied to other similar inference techniques {e.g. the minimization of the AIC 
(Akaike Information Criterion) or maximization of the entropy) although it can be 
shown that the minimization of the BIC yields consistent and optimum estimates 
under rather general conditions (see the discussion). Moreover, as briefly repor- 
ted below, we improved the algorithm of minimization of the BIC by implement- 
ing a more sophisticated technique to characterize decomposable graphical models 
which takes advantage of the internal structure of the network. This allows us to 
treat typical throughput data on gene expression using a reasonable amount of com- 
putational resources. 

The rest of this work is structured as follows: Section [2] briefly revises the basic 
theory of graphical models, describes an efficient procedure for model inference, 
places the problem of co-expression networks in the context of graphical models 
and proposes a general algorithm for clustering genes in a co-expression network. 
Applications of the proposed techniques in three publically available data sets and 
a short simulation study are presented in Section |3l Section |4] presents a brief dis- 
cussion and some concluding remarks. 



2 Methods 



2.1 Co-expression Network viewed as a graphical model 

We construct a gene co-expressi on network using the framework of graphical mod- 
els (^ Lauritzeii , ll996L IWhittakeii [l990 ). In this approach, the expression levels of 
genes (or probes) are represented by random variables that are the vertices (points) 



in a graph. More precisely, let vi 



be the random variables representing the 



expression levels of the p genes under study. We will use the terms variable and 
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vertex interchangeably. The network is characterized by the graph G=(V,E), where 
V = {vi, ...,Vp} is the set of vertices and E is a set of unordered pairs of vertices 
called edges. Two vertices, v, and vj, are adjacent or directly connected by an edge 
when the conditional correlation between v, and Vj, given the other vertices, is not 
zero. In this way, two vertices are adjacent in the graph if, and only if, they carry 
information on each other that is not already contained in the other vertices. The 
absence of an edge directly connecting two vertices, say v, and Vj, indicates that 
the information that carries on Vj (and that Vj carries on v,) is entirely contained 
in the other vertices of V. The graph G=(V,E) is undirected, i.e. if vt is adjacent 
to Vj then vj is also adjacent to vu since the conditional correlation is symmetric. 
Here G=(V,E) is unweighted, in the sense that there is no meaning attributed to the 
length of the edges in any graphical representation of G. 



2.1.1 Basic notions of graph theory 

We introduce below a range of required definitions and notations on graphs. Given 
an undirected graph G=(V,E), a sequence vi,...,Vyt of distinct vertices such that 
{v,-, v,_|_i} is in E, for z = 1, — 1 (k < p), is called a path. Two vertices, v,- and 
Vj are connected when there is a path from v/ to vj. A connected graph has all 
pairs of vertices connected. We reserve the term directly connected (or adjacent) to 
indicated the existence of an edge between two specific vertices. 

A cycle is a path with equal end points. A connected graph with no cycles is 
called a tree and a graph composed of several trees is a forest. A cycle is chordless 
when only successive vertices are adj acent. A graph w ith no chordless cycles longer 



than three is said to be triangulated ( BoUobas . 20001) . Clearly, trees and forests are 
triangulated. 

A sub-graph of G=(V,E) is a graph Ga = (Va,Ea) such that Va C V and 
Ea C E. If V = \a we call Ga a spanning subgraph of G. Any spanning subgraph 
of G=(V,E) can be obtained by deleting some of the edges from E. A spanning sub- 
graph with no cycles is called a spanning tree. Note that a graph has a spanning tree 
if, and only if, it is connected. A maximal spanning forest, or in short a spanning 
forest, in G is a spanning graph consisting of a spanning tree from each connected 
component of G. 

A graph G=(V,E) is complete when E contains all possible pairs of vertices 
in V. A clique is a maximally complete sub-graph, in the sense that the addition 
of any other vertex renders the sub-graph incomplete. A subset C C V separates 
two disjoint subsets of V, A and B, if any path f rom a vertex in A to a vertex in B 



contains at least one vertex in C ( BoUobasl 12000 ). 



We introduce next a convenient representation of any triangulated graph which 
will be crucial for describing the statistical inference and the reduction algorithm 
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Figure 1: A graph with 11 vertices, 20 edges, and 6 cliques: Ci = {1,2,3}, C2 = 
{2,3,8, 10}, C3 = {8,9, 10}, C4 = {4,9, 10}, C5 = {8,9, 11}, and = {2,5,6,7}. 



proposed. Let G=(V,E) be a triangulated graph and Ci , . . . , Q be an enumeration of 
all the cliques of G. Consider the sets Si,..., St given by. Si = and for 7 = 2,..., k. 



0-1 
[jc, 



The sets Si, ... ,5^ are called the separators of the sequence of cliques. That is, a 
separator Sj is given by the intersection of Cj with all previous cliques Q, / < j. 

According to the classic theory of graphs, the graph G is triangulated if, and 
only if, there exist an enumeration of its cliques such that each of the separators 
5i , . . . , 5;(- a r e cornplete and, for eac h i in { 1 , . . . , fc} there is a j < i such that Si C Cj 
( Golumbic . 1980 . Lauritzen . Il996h . One such enumeration is called a perfect se- 
quence of cliques and will be used to perform efficient recursive calculations ne- 
cessary in inference procedures involving triangulated graphs. There are several 
efficient algorithms to find in linear ti me a perfect sequence o f a trian gulated graph, 
as the maximu m cardinali ty search dTarjan and Yannakakis . 1984 h and the lexo- 
graphic search ( Rose et al] .Tl976). For example, in the graph represented in Fig- 
ure[T](A) one perfect sequence is: Ci = {1,2,3}, C2 = {2,3,8, 10}, C3 = {8,9, 10}, 
C4 = {4,9, 10}, C5 = {8,9, 11}, andCs = {2,5,6,7}, with the respective separators 
5i = 0, 52 = {2,3}, 53 = {8, 10}, 54 = {9, 10}, S5 = {8,9}, Se = {2}. 



2.1.2 Statistical inference of networks 

It is assumed that the observations are independent realizations of a multivariate 
normal distribution. Let Vi, . . . , v„ be n independent observations representing the 
expression levels of p genes and assume that 

v,~A^p(jU,E), fori= l,...,n. 
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where Np{- , • ) represents the p-dimensional multivariate normal distribution. Here 
the covariance matrix E is understood as the parameter of interest and the mean 
vector ju is considered as a nuisance parameter, since our main concern is to char- 
acterize the covariance structure. If we consider a subset of the p variables in v/, 
vf = {v/, vf}, with k < p (i.e. if k = p then = v,), then the marginal density 
function of is given by 

f(yk) = I -^(vf-M.)'2,-'(vf-M.) 

(2;r)V2[det(E,)]i/2'^ 

where T.^ and jU^ are respectively the covariance matrix and mean vector of . 

A graphical model is called decomposable if the related graph is triangulated. 
The likelihood function of any decomposable graphical model can be factorized 
according to its cliques and separators. This greatly simplifies the calculation of 
the quantities related to the likelihood function. For instance, the log-likelihood 
function can then be written as 

/(E,/l|y) = £ £ ln/(v?)-i; £ v(5,)ln/(v^), 

where v = {vi, v„}, ^ and y are respectively the sets of cliques and separators 
of G, Cj is a perfect enumeration of the cliques in ^ with respec tive separators Sj 



and v{Sj) is the multiplicity of the separator 5^ in =5^, j = 1 , . . . , A; (|LauritzenL Il99 



The Bayesian Information Criterion, BIC(G), can then be calculated as 

BIC(G) = -2/(E,jlt|v) + Kln(n), 

where K is the number of estimated parameters in the model, given by 2jc» + |E| and 
|E| is the cardinality of the set of edges. 

The classic method for likelihood-based statistical inference under Gaussian 
graphical models (covariance selection models) involves the inversion of the sample 
covariance matrix. This technique cannot be directly applied here because typically 
gene expression data contains a much larger number of variables (the genes, p) than 
observations (indiv iduals or samp les, n), implying that the empirical covariance 



matrix is singular (|Dykstral . 11970 ). For this reason we propose to infer the co- 



expression network by finding the decomposable graphical mod el with mi nimum 



BIC. This technique yields consistent and optimum estimates (see Haughtonlll98 8). 



although our methods could easily be adapted to other similar inference techniques 
(e.g. the minimization of the AIC (Akaike Information Criterion) or maximization 
of the entropy). 

The search for a graphical model with minimum BIC is made in two steps: 
First the spanning forest with minimum BIC is found and then a forward search is 
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performed by successively adding edges that reduce the BIC and still preserve the 
decomposability of the graph. The spanning forest with minimum BIC is construc- 
ted in the following way. Starting with a graphical model containing no edges, the 
edges that decrease mostly the BIC among those edges whose addition does not 
generate cycles in the graph are successively added. Here the improvement in the 
BIC due to the addition of the edge {v,-, v^}, /{v,,v^}' is given by: 

where v,- G A, vj E B, and S separates A and B in the current graph. The sets A and 
B are cliques in the current graph. The constant K" is the number of fr ee parameters 



estimated, and n is the number of observations (number of arrays) (lEdwards et al. 



2010h . The algorithm is: 



Forest search 
begin 

1. V = {vi,...,Vp} andE = 

2. G=(V,E) 

3. for every {v/,vj}, i^j 

4. calculate /{v,,vy} store in F 

5. while |r| > and G is a forest 

6. select e = {v,-, vj} with minimum /{v,,v;} i^i F 

7. if I{vj.vj} < and E U e does not generate a cycle 

8. add e to E 

9. remove e from F 
end 

Once the spanning forest with minimum BIC is found the search continues by 
successively adding edges that further decrease the BIC, as described in the al- 
gorithm below. The algorithm takes advantage of the fact that the addition of an 
edge changes the graph locally, thus only the values of /{.,.} for vertices in the 
altered region need to be re-calculated. Denote by ^{G) the class of edges that can 
be added to a decomposable graph G preserving its decomposability. The algorithm 
is the following: 



8 



Decomposable search 
begin 

1. V={V1,...,V;,} 

2. G = {V,E} 

3. stop = false 

4. while not stop 

5. find ^(G) 

6. update the values of ^{v,.vj} 

7. select e = {v,-, vy } with minimum /|,,.^ . j 

8. if/Kv,}>0 

9. stop = true 

10. else 

11. addetoE 
end 

The step 5 in the alg orithm above - find ^(G) - is typically rather time consuming 
if impl emented as in Abreu et al. ilOl &i. Therefore we used the technique sugges- 



ted by iDeshpande et al.. (,200 1|) : Define the clique graph of a decomposable graph 
G=(V,E) as a new graph G"^*, in which the vertices are the cliques of G, and two 
vertices, v^^ and v^* (representing cliques C; and Cy of G), are directly connected 
by an edge if and only if the set C,- fl Cj separates Ci\{Ci fl Cj) and Cj\{Ci fl Cj) in 
G. An edge {va,Vft} is in S'{G) if, and only if, Va G Q and G Cj, given that the 
edge {Ci,Cj} is in G^'^. 

2.2 An algorithm for clustering genes 

We describe below a method to construct a compact representation of the distribu- 
tion of differentially expressed genes along a complex co-expression network. We 
presuppose that the genes in study can be classified in two categories: differentially 
expressed (DEG) and non-differentially expressed genes (NDEG). 

Suppose that the graph G=(V,E), representing the co-expression network, can 
be split in k connected sub-graphs, say Ci , . . . , Q, termed the components of G. 
Each of these components can be classified as: differentially expressed gene dense 
(DEGD) or non-differentially expressed gene dense (NDEGD); if the proportion of 
DEG in the component exceeds or not a pre-fixed threshold a, respectively. Here 
the proportion of DEG in V is a natural choice for a. The idea of the algorithm 
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below is to merge the components that are neighbours in the graph G and have 
the same classification, producing in this way a collection of subsets of V which 
elements are called gene clusters. We can then construct a second order graph 
Ga: = (K,Ek) for which the vertices are the gene clusters and two clusters, and 
K2, are adjacent when there exist two genes vi E Ki and V2 G K2 such that vi and 
V2 are adjacent in G=(V,E). The graph Gk = {K,Ek) is called the cluster graph. 
When the representation of the co-expression network is a triangulated graph, it is 
natural to construct the graph components using the cliques. 

Let G=(V,E) be a triangulated graph representing the co-expression network for 
which the vertices can be classified as DEG or NDEG. The algorithm below will 
find the gene clusters and output the associated cluster graph: 

Clustering 
begin 

1 . a ^ proportion of DEG in V 

2. find all cliques Ci , . . . , Q of G=(V,E) 

3. for each Cj, i= 1 , . . . , 

4. Uj ^ proportion of DEG in C, 

5. classify Q as DEGD if a, > a 

6. otherwise classify as NDEGD 

7. for J in {DEGD, NDEGD} 

8. Gj ^ sub-graph of G=(V,E) for all vertices in 7-classified cliques 

9. each connected component of G^ is a j-dense cluster 
10. Form the cluster graph Ga: = (K, E^-) 

end 

Figure |2] (A) shows an example with the two distinct groups of vertices: DEG 
containing the genes labelled 5,6,7,9, 10, 11 (represented in red) and NDEG con- 
taining the genes labelled 1,2,3,4,8}, (represented in blue). The graph is trian- 
gulated with cliques given by Ci = {1,2,3}, C2 = {2,3,8, 10}, C3 = {8,9, 10}, 
C4 = {4,9,10}, C5 = {8,9,11}, and Cg = {2,5,6,7}. If the threshold a is %i , 
the cliques C3, C4, C5, and Q are classified as DEGD. As Cg is not directly con- 
nected to other DEGD clique, it solely composes the cluster A = {2,5,6,7}. The 
cliques Q and C2 are directly connected, forming the cluster B = {1,2,3,8, 10}. 
The cluster C = {4, 8, 9, 10, 1 1} is formed by the cliques C3, C4, and C5, all directly 
connected and DEGD. Solving the separators, the final clusters are A = {2,5, 6,7}, 
5 = {1, 3, 10}, and C = {4, 8,9, 11}. Having the vertices clustered in this way, the 
graph can be drastically reduced to the cluster graph displayed in Figure |2](B). 
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(A) 

(B> 

Figure 2: (A) A graph representing DEG (in red) and NDEG (in blue); (B) The 
cluster graph of (A), where each dot now represents a gene cluster (DEGD in red 
and NDEGD in blue). 

2.2.1 A measure of the level of uncertainty of a gene 

The following basic idea will be used to construct a measure of the level of uncer- 
tainty of a gene. Suppose that we establish that a gene, v is differentially expressed. 
We argue that if v is located in a highly interconnected region of the network con- 
taining many other differentially expressed genes, then to say that v is differentially 
expressed essentially does not reduce the uncertainty that we had before the obser- 
vation is done. This occurs because the expression levels of many other differen- 
tially expressed genes are associated to the expression level of v. On the other hand, 
if the interconnected region of the network where v is located contains few (or in the 
extreme case, none) other differentially expressed genes, then declaring v differen- 
tially expressed reduces significantly the uncertainty that we had before observing 
the data; thus suggesting that the informational content of such declaration is high. 

The starting point of the construction below is a segmentation of the network. 
We assume that the genes of the network are classified in disjoint subsets of genes 
constructed in such a way that the expression levels of the genes located in the same 
subset are associated. One way to construct such a stratification is by joining con- 
nected cliques in a decomposable graphical model as proposed in section [Z2l Since 
the gene clusters produced there are not necessarily disjoint, we adopt the following 
convention: if a vertex v is in the intersection of two gene clusters, then v is moved 
to the cluster that has the opposite classification (with an arbitrary choice in the 
ambiguous cases). This choice minimizes the number of clusters falsely classified 
as DEGD. A reciprocal convention would maximize the number of cliques truly 
classified as DEGD. 

The algorithm defined there joined neighbouring cliques with predominance of 
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differentially expressed genes and neighbouring cliques with predominance of non- 
differentially expressed genes. The proposed measure of uncertainty for a cluster K 
is given by 



where riK is the number of differentially expressed genes in K and p is the total 
number of genes in the network. The measure po resembles the classic entropy 
measure of uncertainty, the larger is po the larger is the uncertainty. Since Po{K) is 
a relative measure of uncertainty, it is standardized by dividing po{K) by the larger 
Po found in the network in study, say Po;max, defining in this way the a relative 
measure of uncertainty by 



We attribute the measure of uncertainty Po{K) to each differentially expressed gene 
in the cluster K. 

3 Applications 

3.1 Three examples 

In order to illustrate the proposed methods we present three examples represent- 
ing rather different situations. In the first example - human healthy and diseased 
gingival tissues - we study a relatively large network (54,675 vertices). This ex- 
ample illustrates how difficult it might be to identify patterns in the distribution of 
differentially expressed genes in large co-expression networks by directly examin- 
ing a decomposable graphical model. However, applying the proposed clustering 
procedure allowed us to identify regions of higher concentration of differentially 
expressed genes and to visualize general patterns. 

The second example - muscle water holding capacity in pigs - illustrates a situ- 
ation were most of the differentially expressed genes are located in less intercon- 
nected peripheral regions of the network and just a small proportion of the differ- 
entially expressed genes presented high uncertainty. An opposite situation occurs 
in the third example - intra muscular fatness in pigs - where most of the differen- 
tially expressed genes are located in a large central cluster and just few genes had 
low uncertainty. The first example occupies an intermediary position, presenting 
differentially expressed genes with high, intermediate and low uncertainty. 




p=p{K) 



Po{K) 



Po,max 
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Example 1 - Human healthy and diseased gingival tissues 



T his example arises from a microarray experiment analysed and described in details 



in 



Demmer et al.l (l2008l) . In this study, biopsies from diseased and healthy gingival 



tissues were collected from 90 patients and hybridized to individual arrays (Human 
Genome U133 Plus 2.0, Affymetrix) with 54,675 probe sets in each array. The data 
is publicly available at the Gene Expression Omnibus (accession GSE10334). 

Here we selected the sub-sample of the 64 patients for which one biopsy of 
healthy tissue and at least two biopsies of diseased tissues were available. This 
yielded a data set with 194 arrays (130 from diseased sites). Next, each probe was 
classified as differentially expressed or not using the following procedure: The ex- 
pression of each probe was studied using a gaussian linear mixed model containing 
a random component taking the same value for each patient and a fixed effect rep- 
resenting the tissue status (healthy or diseased). The significance of the fixed effect 
was tested by comparing the likelihood ratio statistics to the quantiles of the empiric 
distribution of 1,000 parametric bootstrap samples drawn under the null hypothesis 
(absence of effect of tissue status). After applying a false discovery ratio correc- 
tion for multiple comparisons the probes were classified as differentially expressed 
(DEP), i.e. presenting statistically significant effect of the tissue status (p < 0.05), 
or not differentially expressed (NDEP). We found 23,773 differentially expressed 
probes (i.e. around 43%). 

The co-expression network was inferred using the data of healthy sites. The 
spanning forest with minimum BIC presented only one connected component. The 
forward search for a decomposable graphical model resulted in the addition of 
24,735 edges, producing a network where 37% of the vertices were leaves (i.e. 
vertices with degree 1) and the maximum vertex's degree was 131. Figure [3] (A) 
displays the inferred network. It is not possible to identify any pattern in this rep- 
resentation of the network because the complexity of the inferred decomposable 
graphical model is too high. On the other hand, this complexity was drastically 
reduced by applying the clustering procedure, as showed in Figure[3](B). We found 
5,081 clusters among which 2,378 (i.e. around 47%) were classified as contain- 
ing predominantly DEP. Figure |4] (A) displays the smallest tree containing the ten 
largest clusters and Figures |4](B)-(D) show three of those clusters with their closest 
neighbouring clusters. 
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Figure 3: Co-expression network of Example 1. (A) Raw representation of the 
decomposable graphical model with minimum BIC; red and blue points represent 
differentially expressed and non-differentially expressed probes, respectively. (B) 
Network representation obtained by the clustering procedure; each point represents 
a cluster, which size is proportional to the number of probes in the cluster; clusters 
with predominance of differentially expressed probes are marked in red and the 
others in blue. 
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Figure 4: Details on the co-expression network of Example 1. (A) The smallest 
tree connecting the ten largest clusters; the sizes of the points are proportional to 
the number of probes in the corresponding clusters; clusters with predominance of 
differentially expressed probes are marked in red and the others in blue. (B) - (D) 
Detailed view of the three largest clusters designated by the respective letters in (A). 
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The largest cluster encountered (see Figure |4] (B)) contained 15,056 probes 
(around 28% of the total number of probes) and occupied a central position in the 
network (see Figure |4] (A)). The level of uncertainty of the DEP in this cluster is 
high, since the cluster contained 44% of the DEP. However, examining Figure[3](B) 
and the distribution of the uncertainty index p (see Figure |7]), it turns out that the 
network contained also DEP with intermediate and low levels of uncertainty. 



Example 2 - Muscle water holding capacity in pigs 



This examp le stems from a s t udy on muscle water holding capacity (WHC) in pigs 
reported by IPonsuksili et al.l (l2008b . The global pattern of gene transcription, as- 
sessed by Affimetrix Porcine Genome arrays containing 24,123 probe sets, and the 
WHC of the muscle longissimus dorsi were determined in 74 animals. The data is 
publicly available at the G ene Expression Omnibus (accession GSE10204). 

Among other analyses. IPonsuksili et a l. (2008) classified the probes as being as- 
sociated or not to the WHC in the following way: A probe was declared associated 
to the WHC (AP) when the absolute value of the Pearson correlation coefficient 
between the WHC and the logarithm of the expression intensity of the probe ex- 
ceeded 0.37 and the p-value for testing the significance of that Pearson correlation 
coefficient was (after accounting for multiple comparisons) smaller than 0.001 (cor- 
responding to a q-value smaller than 0.004). Using this criterion 1,279 probes were 
classified as AP. We characterize here the distribution of the AP probes along the 
co-expression network. 

The co-expression network of the 24,123 probes, inferred as the decomposable 
graphical model with minimum BIC, presented a single connected component with 
34,842 edges. The highest vertex's degree was 69 and around 37% of the vertices 
were leaves. Applying the proposed clustering procedure resulted in 1,521 clusters; 
451 classified as having predominantly AP. The largest cluster (with 12,551 probes) 
had predominantly probes not associated to the WHC. Examining the distribution 
of the uncertainty index p of the AP (see Figure |7]) it turns out that in this example 
there is a large amount of probes with low level of uncertainty. This is also apparent 
from the general structure of the network displayed in Figure [51 

The assumption of decompos ability of the model was verified in a subset of the 
probes. We selected the 60 probes with highest variance (40 non-AP and 20 AP), 
calculated the likelihood function evaluated at the maximum for a complete model 
(using the inverse of the sample covariance matrix), and compared that with the 
likelihood function of an estimated decomposable graphical model. The likelihood 
ratio test yielded a p-value close to one. Note that this procedure was possible 
because we had 60 variables and 74 observation. 
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Figure 5: Co-expression network of Example 2 using the cluster representation. 
Each point represents a cluster, which size is proportional to the number of probes 
in the cluster. Clusters with predominance of probes associated to the muscle water 
holding capacity are marked in red and the others in blue. 



Example 3 - Intra-muscular fatness in pigs 

This exa mple arises frqni a st udy on muscle transcriptomic profiles in pigs de- 



scribed in lCanovas et al.l (|2010|) . Samples from the muscle gluteus medius of 68 an 
imals were collected and hybridized to Affymetrix GeneChip Porcine Genomic ar 
rays with 4,299 probe sets. The animals were selected from two contr asting extreme 



phenotype groups: with high or with low intramuscular fat contents. ICanovas et al. 



bund 1,060 probes presenting expression levels with significant differences 



between the two groups of animals. The data is publicly available at the Gene Ex- 
pression Omnibus (accession GSE19275). 

Here we inferred a co-expression network using the group of animals with high 
intramuscular fat contents and studied the distribution of the differentially expressed 
probes along this network. The decomposable graphical model estimated by min- 
imizing the BIC presented one single connected component with 5,438 edges; the 
highest vertex's degree was 42 and around 38% of the vertices were leaves. 
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Using our clustering procedure we obtained a representation of the network 
with 448 clusters, 132 of them presenting predominance of differentially expressed 
probes. This representation, displayed in Figure [6l presented a large central cluster 
(with 838 differentially expressed probes and in total 1,527 probes, i.e. around 
79% of the differentially expressed probes and 36% of the total of probes). The 
histogram of the measure p, displayed in Figure |7l shows that a large proportion 
of the differentially expressed probes presented a high level of uncertainty, which 
contrasts strongly with the scenario in Example 2. 




Figure 6: Co-expression network of Example 3. Each point represents a cluster, 
which size is proportional to the number of probes in the cluster. Clusters with 
predominance of differentially expressed probes are marked in red and the others in 
blue. 
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Figure 7: Histogram and density estimation (Gaussian kernel estimate, botton left 
plot) of the uncertainty coefficient (p) for the three examples. 



3.2 A short simulation study 

The performance of the proposed cluster-based representation of networks was 
evaluated using the simulation study described below. First we constructed a ref- 
erence data set by selecting the 1 ,200 probes with the highest variance in the data 
of Example 1 (Human healthy and diseased gingival tissues). This data set was 
then used to estimate a co-expression network following the procedure described 
in section 12.1.21 (see Figure [8]). Next we estimated the covariance matrix of this 
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decomposable graphical model by 



£ [{ssdc)-'] - £ v(5) [{ssds)-'] 



Here n is the sample size, ^ is the set of cliques and is the set of separators with 
multiplicity v in a perfect sequence and, for a given set of vertices A, ss dA denotes 
— We assume implicitly that n > maxce'^\C\ (see iLauritzen . 
19961). The covariance matrix E was then kept fixed and applied repeatedly to sim- 
ulate multivariate normally distributed observations to be used in the performance 
evaluations. 




(B) 



i 



k< 10 
k = 574 



Figure 8: Co-expression network of the reference model constructed with the 1,200 
probes with the largest variance in the data of Example 1. (A) Representation of 
the estimated decomposable graphical model; each dot represents a probe (differ- 
entially expressed in red, and non-differentially expressed in blue). (B) Represent- 
ation based on the clustering procedure; each dot represents a cluster, which size is 
proportional to the number of probes in the cluster; clusters with predominance of 
differentially expressed probes are marked in red and the others in blue. 
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The performance of the proposed procedure was then evaluated for 25 different 
sample sizes (from 10 to 250 increasing by 10). For each sample size we construc- 
ted 500 simulated data sets. For each of these data sets the uncertainty indices p of 
the probes were calculated and compared to the uncertainty indices obtained from 
the reference data set. Figure [9] displays the Monte Carlo estimates of the average 
mean square deviation of the uncertainty indices from the simulated reference. As 
expected the performance of the procedure improves substantially as the sample 
size increases. 




Sample size 

Figure 9: Monte Carlo estimate of the mean square error for the uncertainty index 
p for different sample sizes. 



4 Discussion 

We presented arguments indicating that it is essential to take into account the in- 
trinsic structure of interdependency of the expression levels of genes when discuss- 
ing and interpreting the possible association of patterns of differential expression 
and the effect of a perturbation. Our proposal is to use continuous graphical models 
to represent and infer this structure of interdependency. Since this representation 
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uses conditional correlations to define the set of edges in the graph, instead of dir- 
ect correlations as in relevance networks (IButte and Kohanel l2003h . we avoid to 
introduce redundancy and to propagate spurious information in the representation 
of the network. Indeed, with this representation, two genes are directly connected 
to each other if, and only if, the expression level of each of them carries informa- 
tion on the expression level of the other in such a way that this information is not 
alrea dy contained in t he expression levels of the rest of the genes in the network 
(see Whittaked (Il990 ) chapter 4 for an argument based on information theory). 
Moreover, graphical models have suitable mathematical properti es that are essen- 
tial for our const ruction. One example is the separation principle (jEauritzen, 1996, 
Iwhittakeiill990h which states that if two groups of variables, A and B, are separ- 
ated by a third group of variables, C, then A and B are conditionally non-correlated 
given C. Here the expression "group C separates A and B" means that every path 
connecting an element of A with an element of B necessarily contains an element 
of C. This basic property is essential for the interpretation of the relative position 
of a gene in the network. For instance, if a gene v (or a group of genes) is placed 
in the central part of the network and separates the network in two parts, A and B, 
then the knowledge of the values of the expression of v renders the two branches of 
the network independent. This means that the expression level of v carries all the 
information that the expression levels of all the genes in A carry on the expression 
levels of the genes in B (and vice-versa). Clearly, if the separation principle does 
not hold, then the position of a gene in the network looses interpretation. 

Graphical models, in special Gaussian graphical r nodels (or covaria nce selec- 
tion models) have been known for a long period (see bem pstei . 1972 ). however 
the use of these models to analyze large and complex gene-expression data has 
been limited. One of the reasons for this might be that throughput data on gene 
expression typically contains few observations (say 10 to 100) and a much larger 
number of variables (10 to 60 thousand genes), which makes the use of standard 
techniques for statistical inference for graphical models infeasible. For instance, 
the naive application of likelihood-based inference under the covariance selection 
model involves the inversion of the sample covariance matrix; however, it is known 
that if the number of variables exceeds the number of obse rvations, then th e sample 
covariance matrix is not invertible with probability one (iDykstral . Il970h . A way 
to circumvent this problem is to base the inference on alternative methods and, at 
the same time, to restrict the class of graphical models considered. We based the 
inference of the co-expression network on the minimization of the BIC (Bayesian 
Infor mation Criterion ), since this method yields consistent and optimum estimates 
(see Hau ghtonl Il988h . although our methods could easily be adapted to other sim- 
ilar inference techniques (e.g. the minimization of the AIC (Akaike Information 
Criterion) or maximization of the entropy) and restricted the class models used to 
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the Gaussian decomposable graphical models. 

The class of decomposable graphical models has suitable mathematical prop- 
erties that we take advantage of in three different ways: First, these models can 
be s eparated in small components, the cliques, which act somehow independently 
(see Golumbic , 1980) and which are used to construct the representation of the 
co-expression network needed for our methods. Secondly, these models allow 
for a special decomposition of the likelihood function in terms of the structure of 
cliques, which simplifies very mu ch the calculation s involved in some likelihood- 
based quantities such as the BIC (ILauritzenl Il996|) . Thirdly, there exists a rather 
efficient algorithm already implemented for making statistical inference for high 
dimensional decom posable graphical models (the ^RapHD package described in 



Abreu et al. ( lOloh and implemented in R, ( R Development Core Team . 2010 )). 



The recognition of the edges that can be added to a triangulated graph in such a 
way that the new graph is also triangulated is a key operation in this algorithm. 
The computational resources required for this increase rapidly with the dimension 
of the graph if naive methods of direct verification are used, thus forming a bottle- 
neck of the algorithm. Therefore we improved the algorithm by implementing a 
more sophisticated technique to characterize decomposable graphical models which 
takes advantage of the structure of the cliques in decomposable models. The class 
of decomposable graphical models is the largest class of graphical models for which 
the mathematical properties mentioned above hold. Indeed, the existence of the re- 
ferred decomposition of the models and of the likelihood quantities requires that 
the graph representing the model has a perfect enumeration, which is a necessary 
and su fficient condition for a graph to be triangulated (see theorem 4.3 in lOolumbid 
d 19801) '). 

An alternative form of inference of graphical models in general would be to 
minimize the BIC (or other penalized version of the likelihood function) by direct 
search or even by applying sophisticated Monte Carlo based meth ods for optim- 
ization (e.g. simulated annealing as in used in Thomas and GreenI (|2009) ). These 
methods would not be feasible for large and complex graphical models for two 
reasons: First, the calculation of likelihood related quantities for non-decomposable 
graphical models with more vertices than observations is not an easy task; secondly, 
the number of possible graphical models for which the BIC should be evaluated in- 
creases rapidly with the number of vertices. Inde ed, using the fact that the number 
of possible trees formed with n vertices is n"~^ (IGodsil. C. and Royle. G. ] dlOOO), 
corollary 13.2.2), it is easy to see that the number of possible arbitrary graphs that 
can be constructed even with relatively modest number of vertices are huge (e.g. the 
number of possible trees with only 100 vertices is of the order of 10^ ^^). Therefor e, 



restric tions on the type of graphs must be introduced; for exa mple Edwards et al 



(|2010|) considered only the sub-class of trees and forests while lAbreu et al. 
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considered the class of decomposable graphs. 

We claim that the restriction imposed by assuming decomposability of the graph- 
ical model does not seriously compromise the representation of the co-expression 
network. First, we can think on decomposable graphical models (and also spanning 
trees) as forming the skeleton of the (unrestricted) graphical model that adjusts the 
data. Chordless cycles of length larger than three might exist, but would be difficult 
to be detected with moderate sample sizes. Moreover, cycles of this type would 
not abound in a typical co-expression network since they would typically be asso- 
ciated with regulations of gene expressions in strict cascade in such way that the 
last gene in the cascade regulates the first. This cascade of gene regulation would 
be detected as a structure obtained by suppressing some edges from the cycle. For 
instance, making a local analysis of a subset of the probes in Example 2 we found 
using a likelihood ratio test that a saturated model {i.e. the model associated with a 
complete graph) could be reduced to the estimated decomposable graphical model 
indicating absence of chordless cycles of length larger than three in this sub-graph. 
Techniques of local investigations in restricted sub-graphs could be implemented in 
the future as a model control or search for larger regulatory cycles. 

The results of our simulation study indicated that the inference based on the 
minimization of the BIC performed well since the basic structure of the simulated 
network was recovered even for relatively small sample sizes. Moreover, the per- 
formance of the inference improved substantially when the sample size increased 
what is in accordance with the consistency of the estimation via the minimization 
of the BIC predicted theoretically. Furthermore, the algorithm implemented is effi- 
cient enough for inferring large co-expression networks, as illustrated by Example 
1 (54,675 vertices). 

Once having established a reasonable model for the gene co-expression net- 
work, we turned to the question of characterizing the distribution of DEGs (dif- 
ferentially expressed genes) along this structure. As illustrated in Example 1 (see 
Figure [3] (A)), the naive approach of simply marking the DEGs does not reveal any 
pattern due to the complexity of the graph. However, we were able to propose a 
much more compact representation by exploring the internal structure character- 
istic of any decomposable graphical model - the structure of cliques - to determine 
regions of the co-expression network where the DEGs are over represented. This 
yields a second order graph, the cluster graph, with vertices representing clusters 
of genes located in cliques that are contiguous in the co-expression network. Each 
cluster is classified, by construction, in one of two categories: presenting or not over 
representation of DEGs. The cluster graph can be thought as a graph with vertices 
with two colours. As illustrated in the example presented in Figure 2, the cluster 
graph yields a much more compact representation of the network as compared to 
the original co-expression network. The real example of diseased gingival tissues 
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illustrates how drastic can this reduction in the complexity of the representation be 
in real gene expression data. Indeed, the complexity of the interdependency graph 
representing the original expression levels of the 54,675 probes in the Example 1 
(diseased gingival tissues) is so high, that any pattern can be observed (see Fig- 
ure [3] (A)). It is even not possible to realize in Figure [3] (A), that the differentially 
expressed probes (marked in a different colour) are not homogeneously distributed 
in the network. On the other hand, the cluster graph of Example 1 involves only 
5,08 1 vertices (a reduction to less than 10%) and clear patterns of non-homogeneity 
in the distribution of DEGs can be observed (see Figure [3] (B)). For instance, ex- 
amining the cluster graph in details (see Figures |4](B)-(D)) it is possible to identify 
very large clusters containing predominantly DEGs and located in central areas of 
the network. The cluster graph contains also many small clusters located in the peri- 
phery of the network and presenting excess of DEGs. These two types of clusters, 
we argue, are of very different nature. While the expression levels of genes located 
in large central clusters cannot be disentangled from each other, the genes in the 
small peripheral clusters act essentially as isolated from the rest of the network. 
These fundamental qualitative differences in the DEGs cannot be identified if we 
use the original co-expression network. 

Once the co-expression network is stratified in non-overlapping contiguous re- 
gions (the clusters), it is natural to classify the DEGs genes according to the number 
of DEGs present in the cluster they belong to. The associations between the per- 
turbation made in a differential expression study (e.g. treatments, types of tissue, 
etc) and the expression levels of genes belonging to a cluster with many other DEGs 
cannot necessarily be attributed to a single gene or even a reduced number of genes. 
Those genes are considered less informative, since the knowledge that they are dif- 
ferentially expressed does not reduce very much the uncertainty that we had before 
performing the experiment. On the other hand, knowing that a gene is differen- 
tially expressed adds more information if the gene is located in a cluster containing 
only few DEGs. This principle is used when defining the information index p. We 
expressed p as the negative proportion of total number of DEGs represented by 
the number of DEGs in the cluster multiplied by the logarithm of this proportion, 
since this quantity resembles the classic entropy used to represent uncertainty in 
information theory. 

In conclusion, we devised a method to construct compact representations of 
co-expression networks that allows the identification of regions with high concen- 
tration of differentially expressed genes and genes with high informational content. 
We also presented an alternative method of inference of gene co-expression net- 
works based on high dimensional decomposable graphical models and an efficient 
algorithm that allowed us to treat typical throughput data on gene expression using 
reasonable amounts of computational resources. 
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